In all my years of SEO consulting, Iβve seen many clients with wild misconceptions about XML sitemaps. Theyβre a powerful tool, for sure β but like any power tool, a little training and background on how all the bits work goes a long ways.
Indexation
Probably the most common misconception is that the XML sitemap helps get your pages indexed. The first thing weβve got to get straight is this: Google does not index your pages just because you asked nicely. Google indexes pages because (a) they found them and crawled them, and (b) they consider them good enough quality to be worth indexing. Pointing Google at a page and asking them to index it doesnβt really factor into it.
Having said that, it is important to note that by submitting an XML sitemap to Google Search Console, youβre giving Google a clue that you consider the pages in the XML sitemap to be good-quality search landing pages, worthy of indexation. But, itβs just a clue that the pages are important… like linking to a page from your main menu is.
Consistency
One of the most common mistakes I see clients make is to lack consistency in the messaging to Google about a given page. If you block a page in robots.txt and then include it in an XML sitemap, youβre being a tease. “Here, Google… a nice, juicy page you really ought to index,” your sitemap says. But then your robots.txt takes it away. Same thing with meta robots: Donβt include a page in an XML sitemap and then set meta robots “noindex,follow.”
While Iβm at it, let me rant briefly about meta robots: “noindex” means donβt index the page. βNofollowβ means nothing about that page. It means “donβt follow the links outbound from that page,” i.e. go ahead and flush all that link juice down the toilet. Thereβs probably some obscure reason out there for setting meta robots “noindex,nofollow,” but itβs beyond me what that might be. If you want Google to not index a page, set meta robots to “noindex,follow.”
OK, rant overβ¦
In general, then, you want every page on your site to fall into two buckets:
- Utility pages (useful to users, but not anything youβd expect to be a search landing page)
- Yummy, high-quality search landing pages
Everything in bucket #1 should either be blocked by robots.txt or blocked via meta robots “noindex,follow” and should not be in an XML sitemap.
Everything in bucket #2 should not be blocked in robots.txt, should not have meta robots “noindex,” and probably should be in an XML sitemap.
(Bucket image, prior to my decorating them, courtesy of Minnesota Historical Society on Flickr.)
Overall site quality
It would appear that Google is taking some measure of overall site quality, and using that site-wide metric to impact ranking β and Iβm not talking about link juice here.
Think about this from Googleβs perspective. Letβs say youβve got one great page full of fabulous content that ticks all the boxes, from relevance to Panda to social media engagement. If Google sees your site as 1,000 pages of content, of which only 5β6 pages are like this one great pageβ¦ well, if Google sends a user to one of those great pages, whatβs the user experience going to be like if they click a link on that page and visit something else on your site? Chances are, theyβre going to land on a page that sucks. It’s bad UX. Why would they want to send a user to a site like that?
Google engineers certainly understand that every site has a certain number of “utility” pages that are useful to users, but not necessarily content-type pages that should be landing pages from search: pages for sharing content with others, replying to comments, logging in, retrieving a lost password, etc.
If your XML sitemap includes all of these pages, what are you communicating to Google? More or less that you have no clue as to what constitutes good content on your site and what doesn’t.
Hereβs the picture you want to paint for Google instead. Yes, we have a site here with 1,000 pagesβ¦ and here are the 475 of those 1,000 that are our great content pages. You can ignore the others β theyβre utility pages.
Now, let’s say Google crawls those 475 pages, and with their metrics, decides that 175 of those are “A” grade, 200 are “B+,” and 100 are “B” or “B-.” Thatβs a pretty good overall average, and probably indicates a pretty solid site to send users to.
Contrast that with a site that submits all 1,000 pages via the XML sitemap. Now, Google looks at the 1,000 pages you say are good content, and sees over 50% are “D” or “F” pages. On average, your site is pretty sucky; Google probably doesnβt want to send users to a site like that.
The hidden fluff
Remember, Google is going to use what you submit in your XML sitemap as a clue to what’s probably important on your site. But just because it’s not in your XML sitemap doesn’t necessarily mean that Google will ignore those pages. You could still have many thousands of pages with barely enough content and link equity to get them indexed, but really shouldn’t be.
It’s important to do a site: search to see all the pages that Google is indexing from your site in order to discover pages that you forgot about, and clean those out of that “average grade” Google is going to give your site by setting meta robots “noindex,follow” (or blocking in robots.txt). Generally, the weakest pages that still made the index are going to be listed last in a site: search.
Noindex vs. robots.txt
Thereβs an important but subtle difference between using meta robots and using robots.txt to prevent indexation of a page. Using meta robots “noindex,follow” allows the link equity going to that page to flow out to the pages it links to. If you block the page with robots.txt, youβre just flushing that down the toilet.
In the example above, I’m blocking pages that aren’t real pages β they’re tracking scripts β so I’m not losing link equity, as these pages DO NOT have the header with the main menu links, etc.
Think of a page like a Contact Us page, or a Privacy Policy page β probably linked to by every single page on your site via either the main menu or the footer menu. So thereβs a ton of link juice going to those pages; do you just want to throw that away? Or would you rather let that link equity flow out to everything in your main menu? Easy question to answer, isnβt it?
Crawl bandwidth management
When might you actually want to use robots.txt instead? Perhaps if youβre having crawl bandwidth issues and Googlebot is spending lots of time fetching utility pages, only to discover meta robots “noindex,follow” in them and having to bail out. If youβve got so many of these that Googlebot isnβt getting to your important pages, then you may have to block via robots.txt.
Iβve seen a number of clients see ranking improvements across the board by cleaning up their XML sitemaps and noindexing their utility pages:
Do I really have 6,000 to 20,000 pages that need crawling daily? Or is Googlebot chasing reply-to-comment or share-via-email URLs?
FYI, if youβve got a core set of pages where content changes regularly (like a blog, new products, or product category pages) and youβve got a ton of pages (like single product pages) where itβd be nice if Google indexed them, but not at the expense of not re-crawling and indexing the core pages, you can submit the core pages in an XML sitemap to give Google a clue that you consider them more important than the ones that arenβt blocked, but arenβt in the sitemap.
Indexation problem debugging
Hereβs where the XML sitemap is really useful to SEOs: when youβre submitting a bunch of pages to Google for indexing, and only some of them are actually getting indexed. Google Search Console wonβt tell you which pages theyβre indexing, only an overall number indexed in each XML sitemap.
Letβs say youβre an e-commerce site and you have 100,000 product pages, 5,000 category pages, and 20,000 subcategory pages. You submit your XML sitemap of 125,000 pages, and find out that Google is indexing 87,000 of them. But which 87,000?
First off, your category and subcategory pages are probably ALL important search targets for you. Iβd create a category-sitemap.xml and subcategory-sitemap.xml and submit those separately. Youβre expecting to see near 100% indexation there β and if youβre not getting it, then you know you need to look at building out more content on those, increasing link juice to them, or both. You might discover something like product category or subcategory pages that arenβt getting indexed because they have only 1 product in them (or none at all) β in which case you probably want to set meta robots “noindex,follow” on those, and pull them from the XML sitemap.
Chances are, the problem lies in some of the 100,000 product pages β but which ones?
Start with a hypothesis, and split your product pages into different XML sitemaps to test those hypotheses. You can do several at once β nothing wrong with having a URL exist in multiple sitemaps.
You might start with 3 theories:
- Pages that donβt have a product image arenβt getting indexed
- Pages that have less than 200 words of unique description arenβt getting indexed
- Pages that donβt have comments/reviews arenβt getting indexed
Create an XML sitemap with a meaningful number of pages that fall into each of those categories. It doesnβt need to be all pages in that category β just enough that the sample size makes it reasonable to draw a conclusion based on the indexation. You might do 100 pages in each, for instance.
Your goal here is to use the overall percent indexation of any given sitemap to identify attributes of pages that are causing them to get indexed or not get indexed.
Once you know what the problem is, you can either modify the page content (or links to the pages), or noindex the pages. For example, you might have 20,000 of your 100,000 product pages where the product description is less than 50 words. If these arenβt big-traffic terms and youβre getting the descriptions from a manufacturerβs feed, itβs probably not worth your while to try and manually write additional 200 words of description for each of those 20,000 pages. You might as well set meta robots to “noindex,follow” for all pages with less than 50 words of product description, since Google isnβt going to index them anyway and theyβre just bringing down your overall site quality rating. And donβt forget to remove those from your XML sitemap.
Dynamic XML sitemaps
Now youβre thinking, “OK, great, Michael. But now Iβve got to manually keep my XML sitemap in sync with my meta robots on all of my 100,000 pages,” and thatβs not likely to happen.
But thereβs no need to do this manually. XML sitemaps donβt have to be static files. In fact, they donβt even need to have a .XML extension to submit them in Google Search Console.
Instead, set up rules logic for whether a page gets included in the XML sitemap or not, and use that same logic in the page itself to set meta robots index or noindex. That way, the moment that product description from the manufacturerβs feed gets updated by the manufacturer and goes from 42 words to 215 words, that page on your site magically shows up in the XML sitemap and gets its meta robots set to “index,follow.”
On my travel website, I do this for a ton of different kinds of pages. Iβm using classic ASP for those pages, so I have sitemaps like this:
When these sitemaps are fetched, instead of rendering an HTML page, the server-side code simply spits back the XML. This one iterates over a set of records from one of my database tables and spits out a record for each one that meets a certain criteria.
Video sitemaps
Oh, and what about those pesky video XML sitemaps? They’re so 2015. Wistia doesn’t even bother generating them anymore; you should just be using JSON-LD and schema.org/VideoObject markup in the page itself.
Summary
- Be consistent β if itβs blocked in robots.txt or by meta robots “noindex,” then it better not be in your XML sitemap.
- Use your XML sitemaps as sleuthing tools to discover and eliminate indexation problems, and only let/ask Google to index the pages you know Google is going to want to index.
- If youβve got a big site, use dynamic XML sitemaps β donβt try to manually keep all this in sync between robots.txt, meta robots, and the XML sitemaps.
Cornfield image courtesy of Robert Nunnally on Flickr.